On Clustering and Evaluation of Narrow Domain Short-Text Corpora

نویسنده

  • David Pinto
چکیده

PhD thesis in Computer Science written by David Eduardo Pinto Avendaño under the supervision of Paolo Rosso (Univ. Politécnica de Valencia) and Héctor Jiménez (Univ. Autónoma Metropolitana, México). The author was examined in July 2008 in Valencia by the following committee: Manuel Palomar Sanz (Univ. de Alicante), Alfonso Ureña López (Univ. de Jaén), Eneko Agirre (Univ. del Páıs Vasco), Benno Stein (Weimar Univ., Germany) and Encarna Segarra Soriano (Univ. Politécnica de Valencia). The grade obtained was Sobresaliente Cum Laude.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Density-based clustering of short-text corpora∗ Agupamiento de textos cortos basado en densidad

In this work, we analyse the performance of different density-based algorithms on short-text and narrow domain short-text corpora. We attempt to determine to what extent the features of this kind of corpora impact on the density computation of the clusterings obtained and how robust these algorithms to the different complexity levels are.

متن کامل

Characterizing Weblog Corpora

In order to exploit the huge volume of information being published in the blogosphere, it is essential to provide techniques such as clustering, which can automatically analyze and classify their contents. However these typically can produce better results when dealing with wide domain full-text documents. In most cases however, blogs can be considered to be “short texts”, i.e., they are not ex...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Evaluation of Internal Validity Measures in Short-Text Corpora

Short texts clustering is one of the most difficult tasks in natural language processing due to the low frequencies of the document terms. We are interested in analysing these kind of corpora in order to develop novel techniques that may be used to improve results obtained by classical clustering algorithms. In this paper we are presenting an evaluation of different internal clustering validity...

متن کامل

Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Procesamiento del Lenguaje Natural

دوره 42  شماره 

صفحات  -

تاریخ انتشار 2009